Theoretical Formulation of Principal Components Analysis to Detect and Correct for Population Stratification
نویسندگان
چکیده
The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or "population" formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.
منابع مشابه
Principal Components Analysis of Population Admixture
With the availability of high-density genotype information, principal components analysis (PCA) is now routinely used to detect and quantify the genetic structure of populations in both population genetics and genetic epidemiology. An important issue is how to make appropriate and correct inferences about population relationships from the results of PCA, especially when admixed individuals are ...
متن کاملEffect of population stratification analysis on false-positive rates for common and rare variants
Principal components analysis (PCA) has been successfully used to correct for population stratification in genome-wide association studies of common variants. However, rare variants also have a role in common disease etiology. Whether PCA successfully controls population stratification for rare variants has not been addressed. Thus we evaluate the effect of population stratification analysis on...
متن کاملMolecular Reclassification of Crohn’s Disease: A Cautionary Note on Population Stratification
Complex human diseases commonly differ in their phenotypic characteristics, e.g., Crohn's disease (CD) patients are heterogeneous with regard to disease location and disease extent. The genetic susceptibility to Crohn's disease is widely acknowledged and has been demonstrated by identification of over 100 CD associated genetic loci. However, relating CD subphenotypes to disease susceptible loci...
متن کاملAssessing the impact of population stratification on association studies of rare variation.
AIMS The study of rare variants, which can potentially explain a great proportion of heritability, has emerged as an important topic in human gene mapping of complex diseases. Although several statistical methods have been developed to increase the power to detect disease-related rare variants, none of these methods address an important issue that often arises in genetic studies: false positive...
متن کاملA method to correct for population structure using a segregation model
To overcome the "spurious" association caused by population stratification in population-based association studies, we propose a principal-component based method that can use both family and unrelated samples at the same time. More specifically, we adapt the multivariate logistic model, which is often used in segregation analysis and can allow for the family correlation structure, for associati...
متن کامل